date()
## [1] "Mon Nov 28 14:33:56 2022"

Assignment 4: Tasks and Instructions

2. Analysis (max 15 points)

First we install/use R packages we need to complete the assignment.

#install.packages('ggplot2')
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.2.2
library(vtable)
## Warning: package 'vtable' was built under R version 4.2.2
## Loading required package: kableExtra
## Warning: package 'kableExtra' was built under R version 4.2.2
library(MASS)
library(tidyr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following object is masked from 'package:MASS':
## 
##     select
## The following object is masked from 'package:kableExtra':
## 
##     group_rows
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(corrplot)
## corrplot 0.92 loaded
library(GGally)
## Warning: package 'GGally' was built under R version 4.2.2
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
library(plotly)
## Warning: package 'plotly' was built under R version 4.2.2
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:MASS':
## 
##     select
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout

1. Create a new R Markdown

Create a new R Markdown file chapter4.Rmd and include it as a child file in your ‘index.Rmd’ file.


2. Boston data (0-1 points)

Load Boston data from MASS package. Explore structure, dimensions and describe the dataset briefly. (0-1 points)

library(MASS)
data("Boston") #download the Boston data
dim(Boston) #dimensions
## [1] 506  14
str(Boston) #structure
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...

Interpretation of the results

Boston dataset has 506 rows (participants/data points)and 14 columns (variables)

  • crim: per capita crime rate by town.
  • zn: proportion of residential land zoned for lots over 25,000 sq.ft.
  • indus: proportion of non-retail business acres per town
  • chas: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).
  • nox: nitrogen oxides concentration (parts per 10 million).
  • rm: average number of rooms per dwelling.
  • age: proportion of owner-occupied units built prior to 1940.
  • dis: weighted mean of distances to five Boston employment centres.
  • rad: index of accessibility to radial highways.
  • tax: full-value property-tax rate per $10,000.
  • ptratio: pupil-teacher ratio by town.
  • black: 1000(Bk−0.63) 2 where BkBk is the proportion of blacks by town.
  • lstat: lower status of the population (percent).
  • medv: median value of owner-occupied homes in $1000s.

All the variables are numeric, but some might be ordinal scaling rather than interval or absolute scaling. More detailed description of the dataset can be found here


3. Graphical overview and variable summaries.(0-2 points)

  • Show a graphical overview
  • Show summaries of the variables in the data.
  • Describe and interpret the outputs, commenting on the distributions of the variables and the relationships between them.
gg = ggpairs(Boston, mapping = aes(), lower = list(combo = wrap("facethist", bins = 20)), upper = list(continuous = wrap("cor",  size=2)), title="Graphical overview of Boston data")
gg1 = gg + theme(axis.text = element_text(size=5), strip.text.x = element_text(size = 5), strip.text.y = element_text(size=5)) #change the size of the text
gg1

Interpretation of the figure

There is quite many graphs in display, so it is a bit hard to read the results:

  • Diagonally: is presented the distributions of each variable.
  • Below diagonal line: is presented the scatter plots between variables.
  • Above diagonal line: is presented the correlations between the variables.

Only few variables, such as rm (average number of rooms per dwelling) and medv (median value of owner-occupied homes in $1000s) seem to be normally distributed. All the other variables, except chas (Charles River dummy variable) are more or less continuous variables. There seem to be several linear relationship between variables, but also many non-linear relationships can be detected. For example, based on the distributions and scatter plots crim (per capita crime rate by town), zn (proportion of residential land zoned for lots over 25,000 sq.ft.) and black: (1000(Bk−0.63)) seem to be more exponential rather than normal distribution. Lot of the correlations between the variables are weak, yet significant. However, if variables are not normally distributed calculating Pearsons correlations is not the most informative way to explore the dataset.

Another way to illustrate the relationships between the variables is to use corrplot()-command that is part of corrplot-package.

cor_matrix <- cor(Boston) 
corrplot(cor_matrix, method="circle")

Interpretation of the plot

The strength of the correlation is illustrated by adjusting the size and transparency of the correlations:

  • red indicates negative correlation
  • blue indicates positive correlation
  • size indicates if the correlation is significant

As you can see in the picture, most relatinships are either unsignificant or atleast weak. This is due to the non-linear properties of most of the variables in Boston dataset.

st(Boston) # the command prints a summary statistics table to Viewer-window
Summary Statistics
Variable N Mean Std. Dev. Min Pctl. 25 Pctl. 75 Max
crim 506 3.614 8.602 0.006 0.082 3.677 88.976
zn 506 11.364 23.322 0 0 12.5 100
indus 506 11.137 6.86 0.46 5.19 18.1 27.74
chas 506 0.069 0.254 0 0 0 1
nox 506 0.555 0.116 0.385 0.449 0.624 0.871
rm 506 6.285 0.703 3.561 5.886 6.624 8.78
age 506 68.575 28.149 2.9 45.025 94.075 100
dis 506 3.795 2.106 1.13 2.1 5.188 12.126
rad 506 9.549 8.707 1 4 24 24
tax 506 408.237 168.537 187 279 666 711
ptratio 506 18.456 2.165 12.6 17.4 20.2 22
black 506 356.674 91.295 0.32 375.378 396.225 396.9
lstat 506 12.653 7.141 1.73 6.95 16.955 37.97
medv 506 22.533 9.197 5 17.025 25 50

Interpretation of the results

The summary table displays the summary statistics (n, mean, std, min-max, 25% and 75% percentiles). There are no missing values in the data set. The summary statistics can also indicate the distribution/linearity of the variables.

For example, we know that crim has exponential distribution. The data supports this:

  • min-max = 0.006-88.976,
  • yet mean is only 3.614, and
  • the 75% of the data (Pctl. 75) has a value less than 3.677

Whereas, for rm is normally distributed:

  • min-max = 3.561-8.78,
  • pctl.25 = 5.886,
  • mean = 6.285,
  • pctl.75 = 6.624

25% and 75% are close to min and max and mean is located somewhere in the middle. One way to get rid of the issue on non-linearity is to standardize our data, so it will become “normally distributed”.


4. Standardize Boston dataset (0-2 points)

  • Standardize the dataset
  • Print out summaries of the scaled data.
  • How did the variables change?
boston_scaled_KS <- as.data.frame(scale(Boston))
st(boston_scaled_KS)
Summary Statistics
Variable N Mean Std. Dev. Min Pctl. 25 Pctl. 75 Max
crim 506 0 1 -0.419 -0.411 0.007 9.924
zn 506 0 1 -0.487 -0.487 0.049 3.8
indus 506 0 1 -1.556 -0.867 1.015 2.42
chas 506 0 1 -0.272 -0.272 -0.272 3.665
nox 506 0 1 -1.464 -0.912 0.598 2.73
rm 506 0 1 -3.876 -0.568 0.482 3.552
age 506 0 1 -2.333 -0.837 0.906 1.116
dis 506 0 1 -1.266 -0.805 0.662 3.957
rad 506 0 1 -0.982 -0.637 1.66 1.66
tax 506 0 1 -1.313 -0.767 1.529 1.796
ptratio 506 0 1 -2.705 -0.488 0.806 1.637
black 506 0 1 -3.903 0.205 0.433 0.441
lstat 506 0 1 -1.53 -0.799 0.602 3.545
medv 506 0 1 -1.906 -0.599 0.268 2.987

When we standardize variables, we set the same scale for all variables, which allows you to compare scores between different types of variables. When standardizing we set the mean = 0 and std = 1, as can be seen on the summary statistics. Negative values are smaller than mean and positive higher than the mean.

  • Create a categorical variable of the crime rate in the Boston dataset (from the scaled crime rate).
  • Use the quantiles as the break points in the categorical variable.
  • Drop the old crime rate variable from the dataset.
  • Divide the dataset to train and test sets, so that 80% of the data belongs to the train set.
# Create categorical variable
boston_scaled_KS$crim <- as.numeric(boston_scaled_KS$crim) #quantiles are Pctl.25 = -0.411  and Pctl.75 = 0.007

bins <- quantile(boston_scaled_KS$crim) #using quantile (counts min, 25%, 50% (median), 75% and 100%)
bins # save it as a vector, so you can use thos as a cut-offs
##           0%          25%          50%          75%         100% 
## -0.419366929 -0.410563278 -0.390280295  0.007389247  9.924109610
#           0%          25%          50%          75%         100% 
# -0.419366929 -0.410563278 -0.390280295  0.007389247  9.924109610 

crime <- cut(boston_scaled_KS$crim, breaks = bins, labels=c("low", "med_low", "med_high", "high"), include.lowest = TRUE) #categorical variable of crime 
summary(crime) #you can see that variable now has 4 even categories (1=0-25%, 2=25%-50%, 3=50%-75%, 4=75%-100%)
##      low  med_low med_high     high 
##      127      126      126      127
# Drop crim and add crime
boston_scaled_KS <- dplyr::select(boston_scaled_KS, -crim) #discard the old crim variable using -crim in the scaled Boston data
glimpse(boston_scaled_KS) #sanity-check that crim has been deleted
## Rows: 506
## Columns: 13
## $ zn      <dbl> 0.28454827, -0.48724019, -0.48724019, -0.48724019, -0.48724019…
## $ indus   <dbl> -1.2866362, -0.5927944, -0.5927944, -1.3055857, -1.3055857, -1…
## $ chas    <dbl> -0.2723291, -0.2723291, -0.2723291, -0.2723291, -0.2723291, -0…
## $ nox     <dbl> -0.1440749, -0.7395304, -0.7395304, -0.8344581, -0.8344581, -0…
## $ rm      <dbl> 0.4132629, 0.1940824, 1.2814456, 1.0152978, 1.2273620, 0.20689…
## $ age     <dbl> -0.11989477, 0.36680343, -0.26554897, -0.80908783, -0.51067434…
## $ dis     <dbl> 0.1400749840, 0.5566090496, 0.5566090496, 1.0766711351, 1.0766…
## $ rad     <dbl> -0.9818712, -0.8670245, -0.8670245, -0.7521778, -0.7521778, -0…
## $ tax     <dbl> -0.6659492, -0.9863534, -0.9863534, -1.1050216, -1.1050216, -1…
## $ ptratio <dbl> -1.4575580, -0.3027945, -0.3027945, 0.1129203, 0.1129203, 0.11…
## $ black   <dbl> 0.4406159, 0.4406159, 0.3960351, 0.4157514, 0.4406159, 0.41016…
## $ lstat   <dbl> -1.07449897, -0.49195252, -1.20753241, -1.36017078, -1.0254866…
## $ medv    <dbl> 0.15952779, -0.10142392, 1.32293748, 1.18158864, 1.48603229, 0…
boston_scaled_KS <- data.frame(boston_scaled_KS, crime) #add the new crime categorical variable
glimpse(boston_scaled_KS) #sanity-check that crime exist in the dataset
## Rows: 506
## Columns: 14
## $ zn      <dbl> 0.28454827, -0.48724019, -0.48724019, -0.48724019, -0.48724019…
## $ indus   <dbl> -1.2866362, -0.5927944, -0.5927944, -1.3055857, -1.3055857, -1…
## $ chas    <dbl> -0.2723291, -0.2723291, -0.2723291, -0.2723291, -0.2723291, -0…
## $ nox     <dbl> -0.1440749, -0.7395304, -0.7395304, -0.8344581, -0.8344581, -0…
## $ rm      <dbl> 0.4132629, 0.1940824, 1.2814456, 1.0152978, 1.2273620, 0.20689…
## $ age     <dbl> -0.11989477, 0.36680343, -0.26554897, -0.80908783, -0.51067434…
## $ dis     <dbl> 0.1400749840, 0.5566090496, 0.5566090496, 1.0766711351, 1.0766…
## $ rad     <dbl> -0.9818712, -0.8670245, -0.8670245, -0.7521778, -0.7521778, -0…
## $ tax     <dbl> -0.6659492, -0.9863534, -0.9863534, -1.1050216, -1.1050216, -1…
## $ ptratio <dbl> -1.4575580, -0.3027945, -0.3027945, 0.1129203, 0.1129203, 0.11…
## $ black   <dbl> 0.4406159, 0.4406159, 0.3960351, 0.4157514, 0.4406159, 0.41016…
## $ lstat   <dbl> -1.07449897, -0.49195252, -1.20753241, -1.36017078, -1.0254866…
## $ medv    <dbl> 0.15952779, -0.10142392, 1.32293748, 1.18158864, 1.48603229, 0…
## $ crime   <fct> low, low, low, low, low, low, med_low, med_low, med_low, med_l…

NOTE.

dollar sign is replaced with & due RMarkdown syntax

EDITED CODE: in the Pre-exercise-code of Exercise 4.5, straight AFTER the line that starts with boston_scaled <- read.table and ends with sep=“,”, header = T), you should ADD ONE LINE:

boston_scaled&crime <- factor(boston_scaled&crime, levels = c(“low”, “med_low”, “med_high”, “high”))

# Divide dataset into test (20%) and train (80%)
boston_scaled_KS <- read.table("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/boston_scaled.txt", sep=",", header = T)

boston_scaled_KS$crime <- factor(boston_scaled_KS$crime, levels = c("low", "med_low", "med_high", "high"))

ind <- sample(nrow(boston_scaled_KS),  size = nrow(boston_scaled_KS) * 0.8) #creating a indicator that has 80% of the variables
train_4 <- boston_scaled_KS[ind,] #80% of the dataset goes to train dataset
test_4 <- boston_scaled_KS[-ind,] #the remaining 20% (-ind) goes to test dataset
correct_classes_4 <- test_4$crime #save the correct classes from test data
test_4 <- dplyr::select(test_4, -crime) # remove the crime variable from test data
glimpse(test_4) #sanity-check: no crime, rows = 102, columns 13
## Rows: 102
## Columns: 13
## $ zn      <dbl> -0.48724019, -0.48724019, 0.04872402, 0.04872402, -0.48724019,…
## $ indus   <dbl> -0.5927944, -1.3055857, -0.4761823, -0.4761823, -0.4368257, -0…
## $ chas    <dbl> -0.2723291, -0.2723291, -0.2723291, -0.2723291, -0.2723291, -0…
## $ nox     <dbl> -0.7395304, -0.8344581, -0.2648919, -0.2648919, -0.1440749, -0…
## $ rm      <dbl> 1.28144555, 0.20689164, -0.38802695, -0.93028528, -0.47769171,…
## $ age     <dbl> -0.26554897, -0.35080997, -0.07015919, 1.11638970, -0.24068118…
## $ dis     <dbl> 0.556609050, 1.076671135, 0.838414219, 1.086121629, 0.43332522…
## $ rad     <dbl> -0.8670245, -0.7521778, -0.5224844, -0.5224844, -0.6373311, -0…
## $ tax     <dbl> -0.9863534, -1.1050216, -0.5769480, -0.5769480, -0.6006817, -0…
## $ ptratio <dbl> -0.30279450, 0.11292035, -1.50374851, -1.50374851, 1.17530274,…
## $ black   <dbl> 0.39603507, 0.41016511, 0.42637632, 0.32812326, 0.44061589, 0.…
## $ lstat   <dbl> -1.20753241, -1.04229087, -0.03123671, 2.41937935, -0.61518350…
## $ medv    <dbl> 1.32293748, 0.67055821, 0.03992492, -0.65594629, -0.23189977, …
glimpse(train_4) #sanity-check: crime, rows = 404, columns 14
## Rows: 404
## Columns: 14
## $ zn      <dbl> -0.4872402, -0.4872402, -0.4872402, -0.4872402, -0.4872402, -0…
## $ indus   <dbl> 1.0149946, 1.2307270, -0.1642450, -0.1642450, 1.2307270, 2.115…
## $ chas    <dbl> -0.2723291, 3.6647712, -0.2723291, -0.2723291, -0.2723291, -0.…
## $ nox     <dbl> 0.25289548, 0.43412107, -0.06640675, -0.06640675, 0.43412107, …
## $ rm      <dbl> -0.40083620, 2.97511331, -0.52892872, -0.27416693, -0.57589598…
## $ age     <dbl> 0.92099991, 0.89968466, 0.86415924, 0.95297278, 1.02047107, 0.…
## $ dis     <dbl> -0.595876266, -0.775530624, -0.684634922, -0.592219542, -0.667…
## $ rad     <dbl> 1.6596029, -0.5224844, -0.4076377, -0.4076377, -0.5224844, -0.…
## $ tax     <dbl> 1.52941294, -0.03107419, 0.14099473, 0.14099473, -0.03107419, …
## $ ptratio <dbl> 0.8057784, -1.7347012, -0.3027945, -0.3027945, -1.7347012, 0.2…
## $ black   <dbl> -0.27804446, 0.34805866, 0.41925653, 0.44061589, -0.09358721, …
## $ lstat   <dbl> 1.21367625, -1.30695741, 0.49809636, 0.62132734, -0.08725079, …
## $ medv    <dbl> -0.37324861, 2.98650460, -0.40586757, -0.41674056, -0.37324861…
## $ crime   <fct> high, med_high, med_low, med_low, med_high, low, high, med_low…

5. Linear discriminant analysis (LDA) (0-3 points)

  • Fit the linear discriminant analysis on the train set. Use the categorical crime rate as the target variable and all the other variables (.) in the dataset as predictor variables.
  • Draw the LDA (bi)plot.

About the method: In general, LDA method is often use in statistics to find separate groups or clusters (minimum 2) based on different characteristics of the data. LDA has continuous independent variable(s) and a categorical dependent variable. Where factor analysis is creating factors based on the similariries in data, LDA creates them based on differences and defines (in)dependent variables, wheres in factor analysis (especially in exploratory factor analysis) the groups are often data-driven. Also, LDA is used when the groups have already defined (in comparison to cluster analysis). See more details Linear discriminant analysis (LDA).

Here, we already have defined the groups: low, mid_low, mid_high, high. And we want to see how these groups behave based on the other variables in the data.

R code

lda.fit_train <- lda(crime ~ ., data = train_4) # linear discriminant analysis (LDA), "." indicates all other variables
lda.fit_train
## Call:
## lda(crime ~ ., data = train_4)
## 
## Prior probabilities of groups:
##       low   med_low  med_high      high 
## 0.2376238 0.2549505 0.2475248 0.2599010 
## 
## Group means:
##                   zn      indus        chas        nox          rm        age
## low       1.07040579 -0.9518017 -0.06727176 -0.9000804  0.48503328 -0.9016759
## med_low  -0.04598034 -0.3372243 -0.08120770 -0.6020486 -0.09142437 -0.4342430
## med_high -0.36546912  0.1893811  0.31823597  0.3657731  0.20774559  0.4376766
## high     -0.48724019  1.0170492 -0.04735191  1.0422336 -0.43271701  0.8168935
##                 dis        rad        tax     ptratio      black       lstat
## low       0.9386941 -0.6815949 -0.7262722 -0.46013102  0.3747316 -0.77848497
## med_low   0.4127049 -0.5358646 -0.4730269 -0.09785201  0.3149397 -0.22197009
## med_high -0.3885795 -0.4283101 -0.3220487 -0.31341832  0.1461196 -0.06736579
## high     -0.8633897  1.6388211  1.5145512  0.78158339 -0.7479170  0.88715417
##                 medv
## low       0.56409354
## med_low   0.04393632
## med_high  0.24237996
## high     -0.67572477
## 
## Coefficients of linear discriminants:
##                 LD1         LD2         LD3
## zn       0.11442689  0.68184458 -1.14766565
## indus    0.10497753 -0.30213247  0.20948718
## chas    -0.10746203 -0.07191690 -0.02918765
## nox      0.38699548 -0.79233220 -1.21779106
## rm      -0.13097186 -0.14999541 -0.20663793
## age      0.17653794 -0.37808265 -0.28733483
## dis     -0.07238127 -0.28951234  0.20365790
## rad      3.38648550  0.90966797 -0.10521034
## tax     -0.03691444  0.05512577  0.62191235
## ptratio  0.10721692  0.03263558 -0.34643007
## black   -0.12699946  0.03741652  0.12316562
## lstat    0.22636243 -0.06071022  0.34163569
## medv     0.23039780 -0.25545986 -0.11245266
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.9490 0.0377 0.0133

Interpretation of the results

Interpretention of the results are adapted from Linear Discriminant Analysis in R - finnstats and Discriminant Analysis Essentials in R.

  • Prior probabilities of groups: All the groups are even size, approx. 25% (based on the quarterlies)
  • Group means: groups means for each variable based on each category. For example,
    • In zn (proportion of residential land zoned for lots over 25,000 sq.ft.) all the other groups, expect low crime are getting negative values. Since the data is standardized, the positive values indicate that the values are greater than mean. Meaning that in areas with low crime rate the proportions of residential land zones are bigger in comparison to areas with higher crime rates.
    • Inmedv (median value of owner-occupied homes in $1000s.), only areas with “high” crime rate had negative values. This means that the areas with high crime rate have on average less owner-occupied homes (median value in 1000s dollars) than other areas.
  • Coefficients of linear discriminants: first discriminant function (LD1) has all 4 groups (low, mid_low, mid_high, high), second discriminant (LD2) has 3, and LD3 only 2. The coefficient indicate the linear combination of predictors that are used to form the LDA decision rule. For example
    • LD1: .12zn + .11indus - .06chas - .16nox -.13rm + .25age -.07dis + 3.67rad + .07tax + .15ptratio - .15black + .24lstat + .21*medv
    • To create the groups (4 clusters) rad (index of accessibility to radial highways) seems to have the most impact (3.67), followed by age (0.26), lstat (lower status of the population (percent), 0.24) and medv (median value of owner-occupied homes in $1000s, 0.21).
  • Proportion of trace: Percentage separations achieved by the first discriminant function (LD1 = 4 groups) is 95%, LD2 (3 groups) is 4% and LD3 (2groups) 1%

Using the function plot() produces biplots of the linear discriminant, obtained by computing LD1 and LD3 for each of the training observations.

lda.arrows_train <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

classes_4 <- as.numeric(train_4$crime) # target classes as numeric 4 classes: low, mid_low, mid_high, high

# plot the lda results
plot(lda.fit_train, dimen = 2, col = classes_4, pch = classes_4)
lda.arrows_train(lda.fit_train, myscale = 1)

Interpretation of the plot

Similar to the coefficients of linear discriminants values in previous table, rad (index of accessibility to radial highways) seem to have most impact of differianting the clusters. Based on the plot, the coefficient, and group mean rad values seem to be different between high crime rate and other categories. This means that the rad has most impact when differentiating the clusters. Moreover, having access in radial highways seem to be most associated with high criminal rate.


6. Predict LDA model (0-3 points)

  • Save the crime categories from the test set and remove the variable
  • Predict the classes with the LDA model on the test data
  • Cross tabulate the results with the crime categories from the test set
  • Comment on the results

The first two steps have already been done. The test data set should does not contain “crime” variable.

# The first two steps have already been done. The test dataset does not contain "crime" variable.
glimpse(test_4) #sanity check, no crime
## Rows: 102
## Columns: 13
## $ zn      <dbl> -0.48724019, -0.48724019, 0.04872402, 0.04872402, -0.48724019,…
## $ indus   <dbl> -0.5927944, -1.3055857, -0.4761823, -0.4761823, -0.4368257, -0…
## $ chas    <dbl> -0.2723291, -0.2723291, -0.2723291, -0.2723291, -0.2723291, -0…
## $ nox     <dbl> -0.7395304, -0.8344581, -0.2648919, -0.2648919, -0.1440749, -0…
## $ rm      <dbl> 1.28144555, 0.20689164, -0.38802695, -0.93028528, -0.47769171,…
## $ age     <dbl> -0.26554897, -0.35080997, -0.07015919, 1.11638970, -0.24068118…
## $ dis     <dbl> 0.556609050, 1.076671135, 0.838414219, 1.086121629, 0.43332522…
## $ rad     <dbl> -0.8670245, -0.7521778, -0.5224844, -0.5224844, -0.6373311, -0…
## $ tax     <dbl> -0.9863534, -1.1050216, -0.5769480, -0.5769480, -0.6006817, -0…
## $ ptratio <dbl> -0.30279450, 0.11292035, -1.50374851, -1.50374851, 1.17530274,…
## $ black   <dbl> 0.39603507, 0.41016511, 0.42637632, 0.32812326, 0.44061589, 0.…
## $ lstat   <dbl> -1.20753241, -1.04229087, -0.03123671, 2.41937935, -0.61518350…
## $ medv    <dbl> 1.32293748, 0.67055821, 0.03992492, -0.65594629, -0.23189977, …

Implement and predict training model (train_4) for testing data (test_4).

lda.pred_test <- predict(lda.fit_train, newdata = test_4) # predict classes with test data
table(correct = correct_classes_4, predicted = lda.pred_test$class) # cross tabulate the results
##           predicted
## correct    low med_low med_high high
##   low       13      16        2    0
##   med_low    1      16        6    0
##   med_high   0      10       14    2
##   high       0       0        0   22

Interpretation of the results

Overall, the using test data we manage to predict correct 68 values (67%).

  • The model seem to predict well especially the high values (high-high = 30 out of 31)
  • med_high values were predicted correct (med_high - med_high) 13 (48%). 41% was predicted to be med_low (med_high - med_low)
  • med_low values were predicted correct (med_low - med_low) 15 (60%) and 32% as high (med_low - med_high).
  • low values were predicted correcr (low - low) 10 (48%) and 43% as med_low (med_low - low).

7. K-means (0-4 points)

  • Reload the Boston dataset and standardize the dataset.
  • Calculate the distances between the observations.
  • Run k-means algorithm on the dataset.
  • Investigate what is the optimal number of clusters and run the algorithm again.
  • Visualize the clusters using scatterplot
  • Interpret the results

Reload the Boston dataset

data("Boston")
boston_scl <- as.data.frame(scale(Boston))
st(boston_scl)
Summary Statistics
Variable N Mean Std. Dev. Min Pctl. 25 Pctl. 75 Max
crim 506 0 1 -0.419 -0.411 0.007 9.924
zn 506 0 1 -0.487 -0.487 0.049 3.8
indus 506 0 1 -1.556 -0.867 1.015 2.42
chas 506 0 1 -0.272 -0.272 -0.272 3.665
nox 506 0 1 -1.464 -0.912 0.598 2.73
rm 506 0 1 -3.876 -0.568 0.482 3.552
age 506 0 1 -2.333 -0.837 0.906 1.116
dis 506 0 1 -1.266 -0.805 0.662 3.957
rad 506 0 1 -0.982 -0.637 1.66 1.66
tax 506 0 1 -1.313 -0.767 1.529 1.796
ptratio 506 0 1 -2.705 -0.488 0.806 1.637
black 506 0 1 -3.903 0.205 0.433 0.441
lstat 506 0 1 -1.53 -0.799 0.602 3.545
medv 506 0 1 -1.906 -0.599 0.268 2.987
# Create categorical variable
boston_scl$crim <- as.numeric(boston_scl$crim)
bins_scl <- quantile(boston_scl$crim) #using quantile (counts min, 25%, 50% (median), 75% and 100%)
bins_scl # save it as a vector, so you can use thos as a cut-offs
##           0%          25%          50%          75%         100% 
## -0.419366929 -0.410563278 -0.390280295  0.007389247  9.924109610
#           0%          25%          50%          75%         100% 
# -0.419366929 -0.410563278 -0.390280295  0.007389247  9.924109610 

crime_scl <- cut(boston_scl$crim, breaks = bins_scl, labels=c("low", "med_low", "med_high", "high"), include.lowest = TRUE)
summary(crime_scl) #you can see that variable now has 4 even categories (low=127, med_low=126, med_high=126, high=127)
##      low  med_low med_high     high 
##      127      126      126      127
# Drop crim and add crime
boston_scl <- dplyr::select(boston_scl, -crim) #discard the old crim variable using -crim in the scaled Boston data
boston_scl <- data.frame(boston_scl, crime_scl) #add the new crime categorical variable
st(boston_scl) #sanity-check
Summary Statistics
Variable N Mean Std. Dev. Min Pctl. 25 Pctl. 75 Max
zn 506 0 1 -0.487 -0.487 0.049 3.8
indus 506 0 1 -1.556 -0.867 1.015 2.42
chas 506 0 1 -0.272 -0.272 -0.272 3.665
nox 506 0 1 -1.464 -0.912 0.598 2.73
rm 506 0 1 -3.876 -0.568 0.482 3.552
age 506 0 1 -2.333 -0.837 0.906 1.116
dis 506 0 1 -1.266 -0.805 0.662 3.957
rad 506 0 1 -0.982 -0.637 1.66 1.66
tax 506 0 1 -1.313 -0.767 1.529 1.796
ptratio 506 0 1 -2.705 -0.488 0.806 1.637
black 506 0 1 -3.903 0.205 0.433 0.441
lstat 506 0 1 -1.53 -0.799 0.602 3.545
medv 506 0 1 -1.906 -0.599 0.268 2.987
crime_scl 506
… low 127 25.1%
… med_low 126 24.9%
… med_high 126 24.9%
… high 127 25.1%
boston_scl$crime_scl <- factor(boston_scl$crime_scl, levels = c("low", "med_low", "med_high", "high"))

Count distances: Euclidean distance matrix is default, to use manhattan distance matrix, specify method=“manhattan”

# Distances between the observations
# ?dist 
dist_eu_scl <- dist(boston_scl)
## Warning in dist(boston_scl): NAs introduced by coercion
summary(dist_eu_scl)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1394  3.5267  4.9081  4.9394  6.2421 13.0045
dist_man_scl <- dist(boston_scl, method="manhattan")
## Warning in dist(boston_scl, method = "manhattan"): NAs introduced by coercion
summary(dist_man_scl)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2849  8.8386 13.0300 13.8657 18.0953 46.8948

NOTE. I got an error code Warning: NAs introduced by coercion . When crime_scl was a factor. When numeric no error message occured.

Treat crime_scl as numeric rather than factor.

# Distances between the observations
boston_scl$crime_scl <- as.numeric(boston_scl$crime_scl)

dist_eu_scl <- dist(boston_scl)
summary(dist_eu_scl)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1343  3.5760  4.9401  4.9913  6.3033 12.8856
dist_man_scl <- dist(boston_scl, method="manhattan")
summary(dist_man_scl)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2645  8.9648 13.2765 14.1297 18.5263 46.5452

Interpretation of the results

The top one is the euclidean distance and below manhattan distance.
The distance is used to measure the dissimilarity between vectors. dist() command calculates the distance between several vectors (distance between the rows) in a matrix (boston_scl). Larger values indicate that there are greater difference between the rows and therefore, there might be some distinct patterns between different groups.

In general, I have never counted distances, so I am not very familiar with this approach. I found this page quite useful to explain it. If I understood correct, the distance is also arbitrary number, meaning it does not have specific scale to use (like in correlation: -1 to 1). If the distance is zero the points are identical (similar), whereas high values indicate little similarity.

I also found this page helpful to explain the basics.

  • In I understoof correctly, euclidean distance is calculated used linear line, whereas manhattan uses grid-like path similar to chessboard and hence gets overall higher numbers.
  • Based on the summaries, it seems that we have values that are very similar to each other (min = 0.13 and min_manhattan = 0.26), and values that are very different (max = 12.89 and max_manhattan = 46.55).
  • 1st Qu, Median and Mean are relatively similar values (and sort of 3rd Qu.) that might suggest that most values are relatively close together.
  • But since max is so much higher (almost double), we might have either outliers or actual groups with larger distances.

K-means algorithm NOTE. I used the standardized (boston_scl) dataset, not the raw dataset, like in the exercise.

# Investigate what is the optimal number of clusters and run the algorithm again
set.seed(123) #This sets the "spot" where the kmeas()-function will start to generate the k-mean clusters. Setting seed assures that we get the same results each time. Otherwise, it would generate a new k-means each time (different starting point).

k_max_scl <- 10 #define max amount of clusters
twcss_scl <- sapply(1:k_max_scl, function(k){kmeans(boston_scl, k)$tot.withinss}) # calculate the total within sum of square
qplot(x = 1:k_max_scl, y = twcss_scl, geom = 'line', xlim = c(1,10)) # visualize the results

Interpretation of the plot The optimal number of clusters is when the value of total WCSS changes radically. Based on the plot, the most optimal number of clusters would be 2-4 clusters.

Next, we are going to generate and plot these models and then choose, which was is the most optimal.

Choosing the best model & number of clusters

km_2 <- kmeans(boston_scl, centers = 2) # 2 indicates the number of clusters we are generating
km_3 <- kmeans(boston_scl, centers = 3)
km_4 <- kmeans(boston_scl, centers = 4)
# plot the Boston dataset with clusters. Only view the column between 5-10 (nox, rm, age, dis, rad, tax)
pairs(boston_scl, col = km_2$cluster) #2 clusters: red and black

pairs(boston_scl, col = km_3$cluster) #3 clusters: red, black and green

pairs(boston_scl, col = km_4$cluster) #4 clusters: red, black, green and blue  

Interpretation of the plots

We try to find the most optimal number of clusters (minimal overlapping) and detect some variables that seem to have the most distinct clusters.

  • Two clusters (km_2): There is minimal of overlapping between the 2 clusters. Most distinct differences can be detected for rad, black, lstat, and medv
  • Three clusters (km_3): You can still group different colours into their own distinc clusters, not so much overlapping. indus, nox, rm, age, rad and tax seem to have most distinct groups.
  • Four clusters (km_4): when having 4 clusters, it seems that the last cluster (black) is overlapping with other clusters.

Next, we can have a look at the same plots, but focuding only rm, age, dis, rad and ptratio variables

pairs(boston_scl[5:10], col = km_2$cluster) #2 clusters: red and black

pairs(boston_scl[5:10], col = km_3$cluster) #3 clusters: red, black and green

pairs(boston_scl[5:10], col = km_4$cluster) #4 clusters: red, black, green and blue  

Interpretation of the plot

  • Two clusters (km_2): tax and ptratio seem to have less distinct patterns with other variables.
  • Three clusters (km_3): tax and ptratio seem to have less distinct patterns with rad
  • Four clusters (km_4): the 4th clusters (black) is overlapping with other clusters.

Therefore, maybe model with 2-3 clusters are the most optimal to describe the data.


Bonus: to compensate any loss of points from the above exercises (0-2 points)

  • Perform k-means on the original Boston data with some reasonable number of clusters (> 2).
  • Remember to standardize the dataset
  • Then perform LDA using the clusters as target classes.
  • Include all the variables in the Boston data in the LDA model.
  • Visualize the results with a biplot including the arrows representing the relationships between original variables and LDA.
  • Interpret the results.
  • Which variables are the most influential linear separators for the clusters?

Read and standardize

data("Boston")
boston_bonus <- as.data.frame(scale(Boston))
st(boston_bonus) #sanity check
Summary Statistics
Variable N Mean Std. Dev. Min Pctl. 25 Pctl. 75 Max
crim 506 0 1 -0.419 -0.411 0.007 9.924
zn 506 0 1 -0.487 -0.487 0.049 3.8
indus 506 0 1 -1.556 -0.867 1.015 2.42
chas 506 0 1 -0.272 -0.272 -0.272 3.665
nox 506 0 1 -1.464 -0.912 0.598 2.73
rm 506 0 1 -3.876 -0.568 0.482 3.552
age 506 0 1 -2.333 -0.837 0.906 1.116
dis 506 0 1 -1.266 -0.805 0.662 3.957
rad 506 0 1 -0.982 -0.637 1.66 1.66
tax 506 0 1 -1.313 -0.767 1.529 1.796
ptratio 506 0 1 -2.705 -0.488 0.806 1.637
black 506 0 1 -3.903 0.205 0.433 0.441
lstat 506 0 1 -1.53 -0.799 0.602 3.545
medv 506 0 1 -1.906 -0.599 0.268 2.987
# Create categorical variable
boston_bonus$crim <- as.numeric(boston_bonus$crim)
bins_b <- quantile(boston_bonus$crim) #using quantile (counts min, 25%, 50% (median), 75% and 100%)
bins_b # save it as a vector, so you can use those as a cut-offs
##           0%          25%          50%          75%         100% 
## -0.419366929 -0.410563278 -0.390280295  0.007389247  9.924109610
#           0%          25%          50%          75%         100% 
# -0.419366929 -0.410563278 -0.390280295  0.007389247  9.924109610 
st(boston_bonus)
Summary Statistics
Variable N Mean Std. Dev. Min Pctl. 25 Pctl. 75 Max
crim 506 0 1 -0.419 -0.411 0.007 9.924
zn 506 0 1 -0.487 -0.487 0.049 3.8
indus 506 0 1 -1.556 -0.867 1.015 2.42
chas 506 0 1 -0.272 -0.272 -0.272 3.665
nox 506 0 1 -1.464 -0.912 0.598 2.73
rm 506 0 1 -3.876 -0.568 0.482 3.552
age 506 0 1 -2.333 -0.837 0.906 1.116
dis 506 0 1 -1.266 -0.805 0.662 3.957
rad 506 0 1 -0.982 -0.637 1.66 1.66
tax 506 0 1 -1.313 -0.767 1.529 1.796
ptratio 506 0 1 -2.705 -0.488 0.806 1.637
black 506 0 1 -3.903 0.205 0.433 0.441
lstat 506 0 1 -1.53 -0.799 0.602 3.545
medv 506 0 1 -1.906 -0.599 0.268 2.987

crime_b <- cut(boston_bonus$crim, breaks = bins_b, labels=c("low", "med_low", "med_high", "high"), include.lowest = TRUE)
summary(crime_b) #you can see that variable now has 4 even categories (low=127, med_low=126, med_high=126, high=127)
##      low  med_low med_high     high 
##      127      126      126      127
boston_bonus <- dplyr::select(boston_bonus, -crim) #discard the old crim variable using -crim in the scaled Boston data
boston_bonus <- data.frame(boston_bonus, crime_b) #add the new crime categorical variable
boston_bonus$crime_b <- factor(boston_bonus$crime_b, levels = c("low", "med_low", "med_high", "high"))
st(boston_bonus)
Summary Statistics
Variable N Mean Std. Dev. Min Pctl. 25 Pctl. 75 Max
zn 506 0 1 -0.487 -0.487 0.049 3.8
indus 506 0 1 -1.556 -0.867 1.015 2.42
chas 506 0 1 -0.272 -0.272 -0.272 3.665
nox 506 0 1 -1.464 -0.912 0.598 2.73
rm 506 0 1 -3.876 -0.568 0.482 3.552
age 506 0 1 -2.333 -0.837 0.906 1.116
dis 506 0 1 -1.266 -0.805 0.662 3.957
rad 506 0 1 -0.982 -0.637 1.66 1.66
tax 506 0 1 -1.313 -0.767 1.529 1.796
ptratio 506 0 1 -2.705 -0.488 0.806 1.637
black 506 0 1 -3.903 0.205 0.433 0.441
lstat 506 0 1 -1.53 -0.799 0.602 3.545
medv 506 0 1 -1.906 -0.599 0.268 2.987
crime_b 506
… low 127 25.1%
… med_low 126 24.9%
… med_high 126 24.9%
… high 127 25.1%

K-means

#k-means
boston_bonus$crime_b <- as.numeric(boston_bonus$crime_b)
set.seed(123) 
k_max <- 10
twcss_b <- sapply(1:k_max, function(k){kmeans(boston_bonus, k)$tot.withinss})
qplot(x = 1:k_max, y = twcss_b, geom = 'line', xlim = c(1,10))

#based on the plot 2-3 clusters are the most optimal

Interpretation of the plot

Model with 2-3 clusters seem to desrcibe our data the best.

bkm_2 <- kmeans(boston_bonus, centers = 2)
pairs(boston_bonus, col = bkm_2$cluster) 

bkm_3 <- kmeans(boston_bonus, centers = 3)
pairs(boston_bonus, col = bkm_3$cluster)

#based on the plots model with 2 clusters seem to be more optimal. The third cluster seems to get very similar results than the other clusters. 

Interpretation of the plot

Therefore, maybe model with 2 clusters is the most optimal to describe our data. But, since we need to have more than 2 clusters based on the guidelines: Bonus: Perform k-means on the original Boston data with some reasonable number of clusters (> 2). We run the following model with 3 clusters.

Run LDA

For some reason when I tried to knit the Assignment 4, this did not work anymore, but instead gave me an error code: Error in x - group.means[g, ] : non-conformable arrays . Unsure, how to fix it. Below is a screenshot of the model, when it did work.

Here is the code in blue

NOTE. The dollar sign has replaces with &, due to RMarkdown syntax

boston_bonus <- data.frame(boston_bonus, crime_b) #add the new crime categorical variable boston_bonus&crime_b <- factor(boston_bonus&crime_b, levels = c(“low”, “med_low”, “med_high”, “high”))

library(MASS)

boston_bonus&crime_b <- as.numeric(boston_bonus&crime_b)

lda.fit_b <- lda(bkm_3&cluster ~ ., data = boston_bonus) #LDA using the clusters as target classes

lda.fit_b

Picture 1

Interpretation of the results

lda.arrows_b <- function(x, myscale = 1, arrow_heads = 0.1, color = “red”, tex = 0.75, choices = c(1,2)){

heads <- coef(x) arrows(x0 = 0, y0 = 0,

x1 = myscale * heads[,choices[1]],

y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)

text(myscale * heads[,choices], labels = row.names(heads), cex = tex, col=color, pos=3)

}

classes_b <- as.numeric(bkm_3$cluster) # target classes as numeric

# plot the lda results plot(lda.fit_b, dimen = 2, col = classes_b, pch = classes_b)

lda.arrows_b(lda.fit_b, myscale = 1)

Picture 2

Interpretation of the plot

zn (proportion of residential land zoned for lots over 25,000 sq.ft.) seem to be the most influential linear separators between the clusters, followd by age, crime, tax and nox.


Super-Bonus: Plotly() 3D plot (0-3 points) (to compensate any loss of points from the above exercises)

  • Run the code below for the (scaled) train data that you used to fit the LDA. The code creates a matrix product, which is a projection of the data points.
  • Install and access the plotly package. Create a 3D plot (cool!) of the columns of the matrix product using the code below.
  • Adjust the code: add argument color as a argument in the plot_ly() function. Set the color to be the crime classes of the train set.
  • Draw another 3D plot where the color is defined by the clusters of the k-means.
  • How do the plots differ? Are there any similarities?
model_predictors_sb <- dplyr::select(train_4, -crime) #the LD values for the first 3D plot is generated by train_4 data (80%).
# check the dimensions
dim(model_predictors_sb)
## [1] 404  13
dim(lda.fit_train$scaling)
## [1] 13  3
# matrix multiplication
matrix_product_sb <- as.matrix(model_predictors_sb) %*% lda.fit_train$scaling
matrix_product_sb <- as.data.frame(matrix_product_sb)

Install and access plotly and draw 3D plot

#install.packages("plotly")
#plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers')
plot_ly(x = matrix_product_sb$LD1, y = matrix_product_sb$LD2, z = matrix_product_sb$LD3, type= 'scatter3d', mode='markers', color = train_4$crime)
  • Draw another 3D plot where the color is defined by the clusters of the k-means.
  • How do the plots differ? Are there any similarities?

3D plot based on k-means The second plot needs to be separated by k-means. However, train_4 data (80%) was not used to calculate k-means. Instead, based on the assignments, we needed to use the complete Boston dataset. The standardized datasets are:

  • Assignment number 7. K-means, creating k-means and graphs, data = boston_scl, variables = km_2, km_3, km_4
  • Assignment BONUS. data = boston_bonus, variables = bkm_2. bkm_3
data("Boston")
boston_sb <- as.data.frame(scale(Boston))
boston_sb$crim <- as.numeric(boston_sb$crim) # Create categorical variable
bins_sb <- quantile(boston_sb$crim)
crime_sb <- cut(boston_sb$crim, breaks = bins_sb, labels=c("low", "med_low", "med_high", "high"), include.lowest = TRUE)

boston_sb <- dplyr::select(boston_sb, -crim) 
boston_sb <- data.frame(boston_sb, crime_sb) 
boston_sb$crime_sb <- as.numeric(boston_sb$crime_sb)
sbkm_3 <- kmeans(boston_sb, centers = 3)
boston_sb$crim <- as.numeric(boston_sb$crim)
lda.fit_sb <- lda(sbkm_3$cluster ~ ., data = boston_sb) #LDA using the clusters as target classes
## Warning in lda.default(x, grouping, ...): variables are collinear
lda.fit_sb
## Call:
## lda(sbkm_3$cluster ~ ., data = boston_sb)
## 
## Prior probabilities of groups:
##         1         2         3 
## 0.1897233 0.4762846 0.3339921 
## 
## Group means:
##           zn      indus         chas        nox          rm         age
## 1  1.6956974 -1.0817907 -0.067271764 -1.1173803  0.66096776 -1.31662020
## 2 -0.3337899 -0.3632658  0.021728211 -0.3404473  0.05360582 -0.03985203
## 3 -0.4872402  1.1325382  0.007228345  1.1202149 -0.45190478  0.80473300
##           dis        rad        tax    ptratio      black      lstat       medv
## 1  1.40553832 -0.6193863 -0.6437607 -0.6631769  0.3585763 -0.9180096  0.7410061
## 2  0.03865169 -0.5710917 -0.6216579 -0.1578980  0.2878179 -0.2319868  0.1757696
## 3 -0.85353099  1.1662378  1.2521927  0.6018841 -0.6141269  0.8522943 -0.6715802
##   crime_sb     crim
## 1 1.281250 1.281250
## 2 2.132780 2.132780
## 3 3.715976 3.715976
## 
## Coefficients of linear discriminants:
##                   LD1         LD2
## zn       -0.121076625  1.40439032
## indus     0.815181218  0.37919114
## chas     -0.053348063  0.02575069
## nox       1.213770032  0.59654110
## rm       -0.013376446  0.16375801
## age      -0.007550386 -0.49308296
## dis      -0.021956895  0.49261234
## rad       0.630308088  0.29865470
## tax       0.886895618  0.64969738
## ptratio   0.330976614  0.15027688
## black    -0.002217404 -0.04048125
## lstat     0.260179025  0.22374889
## medv      0.149859384  0.23550344
## crime_sb -0.054851160 -0.15575154
## crim     -0.054851160 -0.15575154
## 
## Proportion of trace:
##    LD1    LD2 
## 0.8426 0.1574
model_predictors_sb <- dplyr::select(boston_sb, -crime_sb)
dim(model_predictors_sb)
## [1] 506  14
dim(lda.fit_sb$scaling)
## [1] 15  2
boston_sb$crime_sb <- factor(boston_sb$crime_sb, levels = c("low", "med_low", "med_high", "high"))
boston_sb$crime_sb <- as.numeric(boston_sb$crime_sb)
# matrix_product_sb <- as.matrix(model_predictors_sb) %*% lda.fit_sb$scaling

The last lines matrix_product_sb <- as.matrix(…) gives an ERROR CODE

NOTE. The dollar sign has replaces with & and a star with extra %

Error in as.matrix(model_predictors_sb) %%% lda.fit_sb&scaling : non-conformable arguments

Install and access plotly and draw 3D plot

#plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers')
plot_ly(x = matrix_product_sb$LD1, y = matrix_product_sb$LD2, z = matrix_product_sb$LD3, type= 'scatter3d', mode='markers', color = boston_sb$clusters)

Interpret results

Unsure, why it is not working. Any suggestions?


2. Data wrangling (max 5 points)

This code is for the next week’s data!

See create_human.R and human.csv


End of assignment 4